6.1 - Text generation with Recurrent Neural Networks (RNN)

In this tutorial we will use the Keras deep learning library to construct a simple Recurrent Neural Network (RNN) that can learn linguistic structure from a piece of text, and use that knowledge to generate new text passages. To review general RNN architecture, specific types of RNN networks such as the LSTM networks we'll be using here, and other concepts behind this type of machine learning, you should consult the following resources:

This code is an adaptation of these two examples:

You can consult the original sites for more information and documentation.

Let's start by importing some of the libraries we'll be using in this lab:


In [ ]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

from time import gmtime, strftime
import os
import re
import pickle
import random
import sys

The first thing we need to do is generate our training data set. In this case we will use a recent article written by Barack Obama for The Economist newspaper. Make sure you have the obama.txt file in the /data folder within the /week-6 folder in your repository.


In [ ]:
# load ascii text from file
filename = "data/obama.txt"
raw_text = open(filename).read()

# get rid of any characters other than letters, numbers, 
# and a few special characters
raw_text = re.sub('[^\nA-Za-z0-9 ,.:;?!-]+', '', raw_text)

# convert all text to lowercase
raw_text = raw_text.lower()

n_chars = len(raw_text)
print("length of text:", n_chars)
print("text preview:", raw_text[:500])

Next, we use python's set() function to generate a list of all unique characters in the text. This will form our 'vocabulary' of characters, which is similar to the categories found in typical ML classification problems.

Since neural networks work with numerical data, we also need to create a mapping between each character and a unique integer value. To do this we create two dictionaries: one which has characters as keys and the associated integers as the value, and one which has integers as keys and the associated characters as the value. These dictionaries will allow us to do translation both ways.


In [ ]:
# extract all unique characters in the text
chars = sorted(list(set(raw_text)))
n_vocab = len(chars)
print("number of unique characters found:", n_vocab)

# create mapping of characters to integers and back
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

# test our mapping
print('a', "- maps to ->", char_to_int["a"])
print(25, "- maps to ->", int_to_char[25])

Now we need to define the training data for our network. With RNN's, the training data usually takes the shape of a three-dimensional matrix, with the size of each dimension representing:

[# of training sequences, # of training samples per sequence, # of features per sample]

  • The training sequences are the sets of data subjected to the RNN at each training step. As with all neural networks, these training sequences are presented to the network in small batches during training.
  • Each training sequence is composed of some number of training samples. The number of samples in each sequence dictates how far back in the data stream the algorithm will learn, and sets the depth of the RNN layer.
  • Each training sample within a sequence is composed of some number of features. This is the data that the RNN layer is learning from at each time step. In our example, the training samples and targets will use one-hot encoding, so will have a feature for each possible character, with the actual character represented by 1, and all others by 0.

To prepare the data, we first set the length of training sequences we want to use. In this case we will set the sequence length to 100, meaning the RNN layer will be able to predict future characters based on the 100 characters that came before.

We will then slide this 100 character 'window' over the entire text to create input and output arrays. Each entry in the input array contains 100 characters from the text, and each entry in the output array contains the single character that came after.


In [ ]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100

inputs = []
outputs = []

for i in range(0, n_chars - seq_length, 1):
    inputs.append(raw_text[i:i + seq_length])
    outputs.append(raw_text[i + seq_length])
    
n_sequences = len(inputs)
print("Total sequences: ", n_sequences)

Now let's shuffle both the input and output data so that we can later have Keras split it automatically into a training and test set. To make sure the two lists are shuffled the same way (maintaining correspondance between inputs and outputs), we create a separate shuffled list of indeces, and use these indeces to reorder both lists.


In [ ]:
indeces = list(range(len(inputs)))
random.shuffle(indeces)

inputs = [inputs[x] for x in indeces]
outputs = [outputs[x] for x in indeces]

Let's visualize one of these sequences to make sure we are getting what we expect:


In [ ]:
print(inputs[0], "-->", outputs[0])

Next we will prepare the actual numpy datasets which will be used to train our network. We first initialize two empty numpy arrays in the proper formatting:

  • X --> [# of training sequences, # of training samples, # of features]
  • y --> [# of training sequences, # of features]

We then iterate over the arrays we generated in the previous step and fill the numpy arrays with the proper data. Since all character data is formatted using one-hot encoding, we initialize both data sets with zeros. As we iterate over the data, we use the char_to_int dictionary to map each character to its related position integer, and use that position to change the related value in the data set to 1.


In [ ]:
# create two empty numpy array with the proper dimensions
X = np.zeros((n_sequences, seq_length, n_vocab), dtype=np.bool)
y = np.zeros((n_sequences, n_vocab), dtype=np.bool)

# iterate over the data and build up the X and y data sets
# by setting the appropriate indices to 1 in each one-hot vector
for i, example in enumerate(inputs):
    for t, char in enumerate(example):
        X[i, t, char_to_int[char]] = 1
    y[i, char_to_int[outputs[i]]] = 1
    
print('X dims -->', X.shape)
print('y dims -->', y.shape)

Next, we define our RNN model in Keras. This is very similar to how we defined the CNN model, except now we use the LSTM() function to create an LSTM layer with an internal memory of 128 neurons. LSTM is a special type of RNN layer which solves the unstable gradients issue seen in basic RNN. Along with LSTM layers, Keras also supports basic RNN layers and GRU layers, which are similar to LSTM. You can find full documentation for recurrent layers in Keras' documentation

As before, we need to explicitly define the input shape for the first layer. Also, we need to tell Keras whether the LSTM layer should pass its sequence of predictions or its internal memory as the output to the next layer. If you are connecting the LSTM layer to a fully connected layer as we do in this case, you should set the return_sequences parameter to False to have the layer pass the value of its hidden neurons. If you are connecting multiple LSTM layers, you should set the parameter to True in all but the last layer, so that subsequent layers can learn from the sequence of predictions of previous layers.

We will use dropout with a probability of 50% to regularize the network and prevent overfitting on our training data. The output of the network will be a fully connected layer with one neuron for each character in the vocabulary. The softmax function will convert this output to a probability distribution across all characters.


In [ ]:
# define the LSTM model
model = Sequential()
model.add(LSTM(128, return_sequences=False, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.50))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

Next, we define two helper functions: one to select a character based on a probability distribution, and one to generate a sequence of predicted characters based on an input (or 'seed') list of characters.

The sample() function will take in a probability distribution generated by the softmax() function, and select a character based on the 'temperature' input. The temperature (also often called the 'diversity') effects how strictly the probability distribution is sampled.

  • Lower values (closer to zero) output more confident predictions, but are also more conservative. In our case, if the model has overfit the training data, lower values are likely to give back exactly what is found in the text
  • Higher values (1 and above) introduce more diversity and randomness into the results. This can lead the model to generate novel information not found in the training data. However, you are also likely to see more errors such as grammatical or spelling mistakes.

In [ ]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

The generate() function will take in:

  • input sentance ('seed')
  • number of characters to generate
  • and target diversity or temperature

and print the resulting sequence of characters to the screen.


In [ ]:
def generate(sentence, prediction_length=50, diversity=0.35):
    print('----- diversity:', diversity) 

    generated = sentence
    sys.stdout.write(generated)

    # iterate over number of characters requested
    for i in range(prediction_length):
        
        # build up sequence data from current sentence
        x = np.zeros((1, X.shape[1], X.shape[2]))
        for t, char in enumerate(sentence):
            x[0, t, char_to_int[char]] = 1.

        # use trained model to return probability distribution
        # for next character based on input sequence
        preds = model.predict(x, verbose=0)[0]
        
        # use sample() function to sample next character 
        # based on probability distribution and desired diversity
        next_index = sample(preds, diversity)
        
        # convert integer to character
        next_char = int_to_char[next_index]

        # add new character to generated text
        generated += next_char
        
        # delete the first character from beginning of sentance, 
        # and add new caracter to the end. This will form the 
        # input sequence for the next predicted character.
        sentence = sentence[1:] + next_char

        # print results to screen
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()

Next, we define a system for Keras to save our model's parameters to a local file after each epoch where it achieves an improvement in the overall loss. This will allow us to reuse the trained model at a later time without having to retrain it from scratch. This is useful for recovering models incase your computer crashes, or you want to stop the training early.


In [ ]:
filepath="-basic_LSTM.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=0, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

Now we are finally ready to train the model. We want to train the model over 50 epochs, but we also want to output some generated text after each epoch to see how our model is doing.

To do this we create our own loop to iterate over each epoch. Within the loop we first train the model for one epoch. Since all parameters are stored within the model, training one epoch at a time has the same exact effect as training over a longer series of epochs. We also use the model's validation_split parameter to tell Keras to automatically split the data into 80% training data and 20% test data for validation. Remember to always shuffle your data if you will be using validation!

After each epoch is trained, we use the raw_text data to extract a new sequence of 100 characters as the 'seed' for our generated text. Finally, we use our generate() helper function to generate text using two different diversity settings.

Warning: because of their large depth (remember that an RNN trained on a 100 long sequence effectively has 100 layers!), these networks typically take a much longer time to train than traditional multi-layer ANN's and CNN's. You shoud expect these models to train overnight on the virtual machine, but you should be able to see enough progress after the first few epochs to know if it is worth it to train a model to the end. For more complex RNN models with larger data sets in your own work, you should consider a native installation, along with a dedicated GPU if possible.


In [ ]:
epochs = 50
prediction_length = 100

for iteration in range(epochs):
    
    print('epoch:', iteration + 1, '/', epochs)
    model.fit(X, y, validation_split=0.2, batch_size=256, epochs=1, callbacks=callbacks_list)
    
    # get random starting point for seed
    start_index = random.randint(0, len(raw_text) - seq_length - 1)
    # extract seed sequence from raw text
    seed = raw_text[start_index: start_index + seq_length]
    
    print('----- generating with seed:', seed)
    
    for diversity in [0.5, 1.2]:
        generate(seed, prediction_length, diversity)

That looks pretty good! You can see that the RNN has learned alot of the linguistic structure of the original writing, including typical length for words, where to put spaces, and basic punctuation with commas and periods. Many words are still misspelled but seem almost reasonable, and it is pretty amazing that it is able to learn this much in only 50 epochs of training.

You can see that the loss is still going down after 50 epochs, so the model can definitely benefit from longer training. If you're curious you can try to train for more epochs, but as the error decreases be careful to monitor the output to make sure that the model is not overfitting. As with other neural network models, you can monitor the difference between training and validation loss to see if overfitting might be occuring. In this case, since we're using the model to generate new information, we can also get a sense of overfitting from the material it generates.

A good indication of overfitting is if the model outputs exactly what is in the original text given a seed from the text, but jibberish if given a seed that is not in the original text. Remember we don't want the model to learn how to reproduce exactly the original text, but to learn its style to be able to generate new text. As with other models, regularization methods such as dropout and limiting model complexity can be used to avoid the problem of overfitting.

Finally, let's save our training data and character to integer mapping dictionaries to an external file so we can reuse it with the model at a later time.


In [ ]:
pickle_file = '-basic_data.pickle'

try:
    f = open(pickle_file, 'wb')
    save = {
        'X': X,
        'y': y,
        'int_to_char': int_to_char,
        'char_to_int': char_to_int,
    }
    pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
    f.close()
except Exception as e:
    print('Unable to save data to', pickle_file, ':', e)
    raise
    
statinfo = os.stat(pickle_file)
print('Saved data to', pickle_file)
print('Compressed pickle size:', statinfo.st_size)